AITopics | bias error

Collaborating Authors

bias error

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

ImprovingSampleComplexityBoundsfor(Natural) Actor-CriticAlgorithms

Neural Information Processing SystemsFeb-7-2026, 23:15:29 GMT

The goal of reinforcement learning (RL) [39] is to maximize the expected total reward by taking actions according toapolicyinastochastic environment, whichismodelled asaMarkovdecision process (MDP) [4]. To obtain an optimal policy, one popular method is the direct maximization of the expected total reward via gradient ascent, which is referred to as the policy gradient (PG) method [40,47].

artificial intelligence, machine learning, reinforcement learning, (16 more...)

Neural Information Processing Systems

Country:

North America > Canada > Alberta (0.14)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Risk Comparisons in Linear Regression: Implicit Regularization Dominates Explicit Regularization

Wu, Jingfeng, Bartlett, Peter L., Lee, Jason D., Kakade, Sham M., Yu, Bin

arXiv.org Machine LearningSep-23-2025

Existing theory suggests that for linear regression problems categorized by capacity and source conditions, gradient descent (GD) is always minimax optimal, while both ridge regression and online stochastic gradient descent (SGD) are polynomially suboptimal for certain categories of such problems. Moving beyond minimax theory, this work provides instance-wise comparisons of the finite-sample risks for these algorithms on any well-specified linear regression problem. Our analysis yields three key findings. First, GD dominates ridge regression: with comparable regularization, the excess risk of GD is always within a constant factor of ridge, but ridge can be polynomially worse even when tuned optimally. Second, GD is incomparable with SGD. While it is known that for certain problems GD can be polynomially better than SGD, the reverse is also true: we construct problems, inspired by benign overfitting theory, where optimally stopped GD is polynomially worse. Finally, GD dominates SGD for a significant subclass of problems -- those with fast and continuously decaying covariance spectra -- which includes all problems satisfying the standard capacity condition.

proposition 2, regression, ridge regression, (14 more...)

arXiv.org Machine Learning

2509.17251

Country:

North America > United States > California > Alameda County > Berkeley (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.88)

Add feedback

Understanding SGD with Exponential Moving Average: A Case Study in Linear Regression

Li, Xuheng, Gu, Quanquan

arXiv.org Machine LearningFeb-19-2025

Exponential moving average (EMA) has recently gained significant popularity in training modern deep learning models, especially diffusion-based generative models. However, there have been few theoretical results explaining the effectiveness of EMA. In this paper, to better understand EMA, we establish the risk bound of online SGD with EMA for high-dimensional linear regression, one of the simplest overparameterized learning tasks that shares similarities with neural networks. Our results indicate that (i) the variance error of SGD with EMA is always smaller than that of SGD without averaging, and (ii) unlike SGD with iterate averaging from the beginning, the bias error of SGD with EMA decays exponentially in every eigen-subspace of the data covariance matrix. Additionally, we develop proof techniques applicable to the analysis of a broad class of averaging schemes.

excess risk, inequality hold, sgd, (16 more...)

arXiv.org Machine Learning

2502.14123

Country: North America > United States > California > Los Angeles County > Los Angeles (0.28)

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.54)

Add feedback

Achieving Tractable Minimax Optimal Regret in Average Reward MDPs

Boone, Victor, Zhang, Zihan

arXiv.org Machine LearningJun-3-2024

In recent years, significant attention has been directed towards learning average-reward Markov Decision Processes (MDPs). However, existing algorithms either suffer from sub-optimal regret guarantees or computational inefficiencies. In this paper, we present the first tractable algorithm with minimax optimal regret of $\widetilde{\mathrm{O}}(\sqrt{\mathrm{sp}(h^*) S A T})$, where $\mathrm{sp}(h^*)$ is the span of the optimal bias function $h^*$, $S \times A$ is the size of the state-action space and $T$ the number of learning steps. Remarkably, our algorithm does not require prior information on $\mathrm{sp}(h^*)$. Our algorithm relies on a novel subroutine, Projected Mitigated Extended Value Iteration (PMEVI), to compute bias-constrained optimal policies efficiently. This subroutine can be applied to various previous algorithms to improve regret bounds.

algorithm, confidence region, inequality, (13 more...)

arXiv.org Machine Learning

2406.01234

Country:

Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.04)
Asia > Japan > Honshū > Tōhoku (0.04)
North America > United States > Virginia > Arlington County > Arlington (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Search (0.61)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.46)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)

Add feedback

Sharp error estimates for target measure diffusion maps with applications to the committor problem

Sule, Shashank, Evans, Luke, Cameron, Maria

arXiv.org Artificial IntelligenceDec-21-2023

We obtain asymptotically sharp error estimates for the consistency error of the Target Measure Diffusion map (TMDmap) (Banisch et al. 2020), a variant of diffusion maps featuring importance sampling and hence allowing input data drawn from an arbitrary density. The derived error estimates include the bias error and the variance error. The resulting convergence rates are consistent with the approximation theory of graph Laplacians. The key novelty of our results lies in the explicit quantification of all the prefactors on leading-order terms. We also prove an error estimate for solutions of Dirichlet BVPs obtained using TMDmap, showing that the solution error is controlled by consistency error. We use these results to study an important application of TMDmap in the analysis of rare events in systems governed by overdamped Langevin dynamics using the framework of transition path theory (TPT). The cornerstone ingredient of TPT is the solution of the committor problem, a boundary value problem for the backward Kolmogorov PDE. Remarkably, we find that the TMDmap algorithm is particularly suited as a meshless solver to the committor problem due to the cancellation of several error terms in the prefactor formula. Furthermore, significant improvements in bias and variance errors occur when using a quasi-uniform sampling density. Our numerical experiments show that these improvements in accuracy are realizable in practice when using $\delta$-nets as spatially uniform inputs to the TMDmap algorithm.

artificial intelligence, machine learning, point cloud, (18 more...)

arXiv.org Artificial Intelligence

2312.14418

Country:

North America > United States > New York > New York County > New York City (0.14)
North America > United States > Maryland > Prince George's County > College Park (0.14)

Genre: Research Report > New Finding (0.48)

Industry: Energy > Oil & Gas > Upstream (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Mathematics of Computing (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
(2 more...)

Add feedback

Risk Bounds of Accelerated SGD for Overparameterized Linear Regression

Li, Xuheng, Deng, Yihe, Wu, Jingfeng, Zhou, Dongruo, Gu, Quanquan

arXiv.org Machine LearningNov-23-2023

Accelerated stochastic gradient descent (ASGD) is a workhorse in deep learning and often achieves better generalization performance than SGD. However, existing optimization theory can only explain the faster convergence of ASGD, but cannot explain its better generalization. In this paper, we study the generalization of ASGD for overparameterized linear regression, which is possibly the simplest setting of learning with overparameterization. We establish an instance-dependent excess risk bound for ASGD within each eigen-subspace of the data covariance matrix. Our analysis shows that (i) ASGD outperforms SGD in the subspace of small eigenvalues, exhibiting a faster rate of exponential decay for bias error, while in the subspace of large eigenvalues, its bias error decays slower than SGD; and (ii) the variance error of ASGD is always larger than that of SGD. Our result suggests that ASGD can outperform SGD when the difference between the initialization and the true weight vector is mostly confined to the subspace of small eigenvalues. Additionally, when our analysis is specialized to linear regression in the strongly convex setting, it yields a tighter bound for bias error than the best-known result.

artificial intelligence, inequality hold, machine learning, (16 more...)

arXiv.org Machine Learning

2311.14222

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.28)
North America > United States > California > Alameda County > Berkeley (0.14)
North America > United States > Indiana > Monroe County > Bloomington (0.04)
(2 more...)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.81)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.54)

Add feedback

PER-ETD: A Polynomially Efficient Emphatic Temporal Difference Learning Method

Guan, Ziwei, Xu, Tengyu, Liang, Yingbin

arXiv.org Machine LearningOct-13-2021

As a major value function evaluation method, temporal difference (TD) learning (Sutton, 1988; Dayan, 1992) has been widely used in various planning problems in reinforcement learning. Although TD learning performs successfully in the on-policy settings, where an agent can interact with environments under the target policy, it can perform poorly or even diverge under the off-policy settings when the agent only has access to data sampled by a behavior policy (Baird, 1995; Tsitsiklis and Van Roy, 1997; Mahmood et al., 2015). To address such an issue, the gradient temporal-difference (GTD) (Sutton et al., 2008) and least-squares temporal difference (LSTD) (Yu, 2010) algorithms have been proposed, which have been shown to converge in the off-policy settings. However, since GTD and LSTD consider an objective function based on the behavior policy, their converging points can be largely biased from the true value function due to the distribution mismatch between the target and behavior policies, even when the express power of the function approximation class is arbitrarily large (Kolter, 2011). In order to provide a more accurate evaluation, Sutton et al. (2016) proposed the emphatic temporal difference (ETD) algorithm, which introduces the follow-on trace to address the distribution mismatch issue. The stability of ETD was then shown in Sutton et al. (2016); Mahmood et al. (2015), and the asymptotic convergence guarantee for ETD was established in Yu (2015), it has also achieved great success in many tasks (Ghiassian et al., 2016; Ni, 2021).

algorithm, max 1, per-etd, (12 more...)

arXiv.org Machine Learning

2110.06906

Country:

North America > Canada > Alberta (0.14)
North America > United States > Ohio (0.04)
Asia > Middle East > Jordan (0.04)

Genre: Research Report (0.63)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Last Iterate Risk Bounds of SGD with Decaying Stepsize for Overparameterized Linear Regression

Wu, Jingfeng, Zou, Difan, Braverman, Vladimir, Gu, Quanquan, Kakade, Sham M.

arXiv.org Machine LearningOct-12-2021

Stochastic gradient descent (SGD) has been demonstrated to generalize well in many deep learning applications. In practice, one often runs SGD with a geometrically decaying stepsize, i.e., a constant initial stepsize followed by multiple geometric stepsize decay, and uses the last iterate as the output. This kind of SGD is known to be nearly minimax optimal for classical finite-dimensional linear regression problems (Ge et al., 2019), and provably outperforms SGD with polynomially decaying stepsize in terms of the statistical minimax rates. However, a sharp analysis for the last iterate of SGD with decaying step size in the overparameterized setting is still open. In this paper, we provide problem-dependent analysis on the last iterate risk bounds of SGD with decaying stepsize, for (overparameterized) linear regression problems. In particular, for SGD with geometrically decaying stepsize (or tail geometrically decaying stepsize), we prove nearly matching upper and lower bounds on the excess risk. Our results demonstrate the generalization ability of SGD for a wide class of overparameterized problems, and can recover the minimax optimal results up to logarithmic factors in the classical regime. Moreover, we provide an excess risk lower bound for SGD with polynomially decaying stepsize and illustrate the advantage of geometrically decaying stepsize in an instance-wise manner, which complements the minimax rate comparison made in previous work.

geometrically, sgd, stepsize, (15 more...)

arXiv.org Machine Learning

2110.06198

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.28)
North America > United States > Washington > King County > Seattle (0.14)
North America > United States > Maryland > Baltimore (0.04)

Genre: Research Report > New Finding (0.86)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (1.00)

Add feedback

5 Simple Tips to Supercharge your Machine Learning Practice

#artificialintelligenceAug-12-2021, 05:55:34 GMT

Since I was in high school, I've had this weird obsession of squeeze the key concepts of everything that I learn in one page. Looking back, that was probably my lazy mind's way to get away with the least amount of required work to pass an exam…but interestingly that abstraction effort also helped a lot to learn those concepts in a deeper level and to remember them longer. Nowadays when I teach Machine Learning, I try to teach it in two parallel tracks: a) main concepts and b) methods and theoretical details, and make sure my students can look at each new method through the lens of the same concepts. Recently I got a chance to read "Machine Learning Yearning" by Andrew Ng, which seemed to be his version of abstracting some of the practical ML concepts without getting into any formula or implementation details. While they can see so simple and obvious, as an ML engineer I can attest that losing sight of those simple tips are among the most common causes for an ML research to fail in production, and being mindful of them is what distinguishes a good data science work from a mediocre one.

available data, bias error, variance error, (14 more...)

#artificialintelligence

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.30)

Add feedback

Filters

Collaborating Authors

bias error

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

ImprovingSampleComplexityBoundsfor(Natural) Actor-CriticAlgorithms

2e1b24a664f5e9c18f407b2f9c73e821-Paper.pdf

Risk Comparisons in Linear Regression: Implicit Regularization Dominates Explicit Regularization

Understanding SGD with Exponential Moving Average: A Case Study in Linear Regression

Achieving Tractable Minimax Optimal Regret in Average Reward MDPs

Sharp error estimates for target measure diffusion maps with applications to the committor problem

Risk Bounds of Accelerated SGD for Overparameterized Linear Regression

PER-ETD: A Polynomially Efficient Emphatic Temporal Difference Learning Method

Last Iterate Risk Bounds of SGD with Decaying Stepsize for Overparameterized Linear Regression

5 Simple Tips to Supercharge your Machine Learning Practice